Which chemical properties influence the quality of red wines?
## [1] 1599 14
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality" "quality.factor"
## 'data.frame': 1599 obs. of 14 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## $ quality.factor : Factor w/ 6 levels "3","4","5","6",..: 3 3 3 4 3 3 3 5 5 3 ...
## $title
## [1] "Quality (Ranking)"
##
## attr(,"class")
## [1] "labels"
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
The vast majority of the wines got ranked 5 and 6 with each rank away from 5 and 6 having a magnitude less number
Overall acidity, alcohol level, and the fixed acidity all are normally distributed with positive skewness on acohol and fixed acidity.
Citric acid in a large number of wines is 0, and the distribution is positively–and noticeable flat– skewed.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
Each acidity measurement seems to be different. Fixed acidity seems to be positively skewed while volatile acidity is less harshly skewed but has some positive outliers while citric acid has a almost exponential distribution.
## [1] 0.079
There are several major outliers for chloride measurment, but around the mean and median the distribution is normal.
Both measueres of sulfur dioxide (SO2) are positively skewed. According to the description of the data “SO2 concentrations over 50 ppm [are] evident in the nose and taste of wine.” It may be interesting later to see what affect this evident taste has on a wine’s rating.
# Number of entries with sulfur ppm above 50.
dim(subset(rw, total.sulfur.dioxide >= 50))[1]
## [1] 557
557 / 1599
## [1] 0.3483427
Out of the 1599 entries there are 557 with SO2 above 50. This is about 34.8% of the entries.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
Residual sugars are normal with outliers.
This seems to be a reoccurring pattern. Later it may be interesting to compare the tails with each other to see if there is a correlation between the extremes and good or bad wine.
Density is tightly distributed with a normal distribution showing no apparent skewness.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
Positively skewed with extreme positive outliers. Also, the entries are fairly spread out. According to the data file sulphates “can contribute to sulfur dioxide gas”.
Both do have similar tails, but the total sulfur dioxide is not as normal as the sulphates.
Mnay of the variables are distributed normally. With the exception of density and alcohol, the distributions are skewed or have extreme outliers. The interesting thing about this is that several of the distributions appear to be skewed in the same direction. Upon further investigation it also seems that they are related chemically (sulphates and SO2).
These variables should be of interest when comparing multiple varialbes so that we can test if they are actually correlated (and how they pair of variables affect quality).
The guiding question is about the quality of wines. All other measurements are about the chemical or physical characteristics of the wines. Therefore, in order to answer the question the variables will have to be compared to the quality individually and eventually in groups. Overall, understanding each measurements affect on a wine and then comparing how that affects the rated quality will be very important in the following (bivariate and multi-variate) analysis.
The ratings where all very tightly packed and I had a feeling the number of wines in each rating would drop by a magnitude every rating higher or lower than 5 or 6. This was shown to be approximatley true by chaning the y-axis from continuous to log(10).
The first chlorides graph I made was very spread out with noticeable outliers, but also signs of a tail. In order to investigate further I changed the x-axis from continuous to log(10). The new graph was better, but there did not appear to be a tail as I had expected and the majority of the data seemed distributed around a fairly small section. Therefore, I decided to simply remove the ouliers (anything above the 98% quaantile) and keep the x-axis continuous. The final graph shows what I had expected after seeing the log graph; normally distributed around a small range with a trickle of outliers that did not represent a signficant positive skewness.
Total sulphur dioxide (SO2) was positively skewed, similiar to free SO2, but had a couple extreme outliers. To understand the main distribution of total SO2 I removed the values that were greater than the 99th quantile.
Residual sugars were adjusted by removing the values greater than the 99th quantile and by log transforming the x-axis (done seperately). The log and quantile graph showed the same distribution for the most part, but the quantile graph was easier to understand because of the uniform breaks.
Sulphates had extreme outliers that were removed so that it was easier to understand the ranges within the main distribution.
There is not a variable that has a clear correlation with quality (e.g., correlation > 0.5).
As quality goes up the sulphates seem to increase slightly but there is a lot of overlaps between the distributions.
It appears that the percent of the SO2 that is free doesn’t have a noticeable affect on quality.
The data at the extreme qualities is rather limited. This makes it hard to pick up on possible trends/correlations. Looking at the graphs it appears that the lower quality citric acid distributions are more left shifted than the higher quality distributions, but there are exceptions (i.e., wines rated 7 and 8).
This shows the best correlation between a variable and quality so far. It is far from perfect. For instance, in the lower rankings the pattern of increasing alcohol does not correlate with a better ranking. In multivariate I should see if another variable and a respectable amount of alcohol make for a bad wine.
Chlorides may be that other variable I test against.
Both graphs have a blob like scatter. There is no clear trend between just the two variables. It would be interesting to see if a third variable would order or seperate these two variables.
This will be a must explore with quality colored in. Volatile acidity adds a bad vinegar taste in high amounts and citric acid in low (but existant) amounts adds freshness. I would hypothesize that the points on the left will ranked far lower than the points on the right.
It appears that volatile acidity has an affect on quality. As the quality increases the peaks of each distribution moves right (volatile acidity decreases).
As sulphates increased there seemed to be an increase in the quality. Although it wasn’t perfect there did seem to be a noticeable trend.
As alcohol increased there was an increase in quality. This relationship was much more pronounced than in the sulphates, and there were less extreme outliers in the alcohol distributions than there were in the sulpahtes distribution.
Finally, the less volatile acidity in the wine the better rating it got. This is not universal, there are overlaps between the quality distributions, but it is noticeable.
As citric acid increased there was a decrease in volatile acidity. Therefore, I thought that there would be a correlation between citric acid and quality. However, there was no correlation between citric acid and quality.
I wonder if the wines with low volatile acidity, but not ranked high were due to a high citric acid (possible too high thus driving down the volatile acidity artifically).
It is a tie between alcohol’s and volatile acidity’s relationship with quality.
Nothing stands out in the graphs above.
There doesn’t seem to be any grouping between volatile and fixed acidity. In addition, there is not clear pattern with relation to quality in the relationship either.
Nothing stands out in the graph above.
ggplot(data=subset(rw, citric.acid > quantile(citric.acid, .5) &
volatile.acidity < quantile(volatile.acidity, .5)),
aes(quality.factor)) +
geom_bar(fill='blue')+
geom_bar(data=rw, fill='red', alpha=.5) +
geom_bar(data=subset(rw, citric.acid > quantile(citric.acid, .75) &
volatile.acidity < quantile(volatile.acidity, .25)), fill='blue')